Case Study 02: How can a wellness company play it smart?¶

Author: Vinícius Alves

Date: 03/31/2023

Version: 1.0

About the Company: Bellabeat¶

Bellabeat is a high-tech company that manufactures health-focused smart products. The company is focused on women and develop gadgets to collect data on activity, sleep, stress, and reproductive health. That has allowed Bellabeat to empower women with the knowledge about their own health and habits.

Products¶

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

  • Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

The analysis¶

Ask phase¶

To help Bellabeat improve a product, a study was commissioned on user behavior using another company's healthy data collection gadgets.

The questions raised in this analysis were:

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

Prepare phase:¶

The data used¶

The data used in this analysis is from Fitbit's gadgets available on Kaggle.

This dataset was collected by a distributed survey via Amazon Mechanical Turk between March 12, 2016 and May 12, 2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. The data does not have any personal information of the users.

This data is open-source and have the CC0: Public Domain License, which means anyone can copy, modify, distribute, work with it, even for commercial purposes, without asking permission.

How was the data distributed?¶

The dataset is composed of 18 .csv files. 15 in long format, 3 in wide format.

Bias¶

About bias in the data, I think hat some points need to be discussed:

About the Sampe size:¶

The data has 33 people data in the majority of the shared datasets, but 33 people is just a piece of the entire population that uses smart gadgets. On some archives we just have 24 or 8 people information. Because of that we cannot say that our data is not biased.

About the missing values:¶

17 of the 18 archives have no missing information, but in the 'weightLogInfo_merged.csv' we only have Fat information of two people. Thus, that information is not supposed to be used.

About confounding variables:¶

Most of the datasets have readible understandable variables, but, the 'minuteSleep_merged.csv' has a variable called 'value' that its not clear what is. Maybe it is the minute slept, which in this case, the value 1 make sense, but in some points it has a value of 2 or 3.

After checking Fitabase Data Dictionary , these values can be translate in: 1 = asleep, 2 = restless, 3 = awake.

About the selection and measurement bias:¶

As this dataset was constructed with the consentiment of a few people that used these gadgets, it is already biased, because they knew that their data was being collected. So, we already know that we have the bias of type of people that used some gadgets and agreed to share the data, and about the change in the behavior in the period that the data was collected?

In [2]:
# Libraries
import pandas as pd
import numpy as np
import statistics as st

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

import os

The data source archives used in this study are listed below with some data analysis

In [3]:
# Getting the archives
import fnmatch

extensions = ['*.csv']

archives = []

folder_path = "Data_Coursera_CaseStudy02/"


# Walk through the folder and its subdirectories to find CSV files
for root, dirs, files in os.walk(folder_path):
    for csv_extension in extensions:
        for filename in fnmatch.filter(files, csv_extension):
            csv_path = os.path.join(root, filename)
            archives.append(csv_path)

archives
Out[3]:
['Data_Coursera_CaseStudy02/dailyActivity_merged.csv',
 'Data_Coursera_CaseStudy02/dailyCalories_merged.csv',
 'Data_Coursera_CaseStudy02/dailyIntensities_merged.csv',
 'Data_Coursera_CaseStudy02/dailySteps_merged.csv',
 'Data_Coursera_CaseStudy02/heartrate_seconds_merged.csv',
 'Data_Coursera_CaseStudy02/hourlyCalories_merged.csv',
 'Data_Coursera_CaseStudy02/hourlyIntensities_merged.csv',
 'Data_Coursera_CaseStudy02/hourlySteps_merged.csv',
 'Data_Coursera_CaseStudy02/minuteCaloriesNarrow_merged.csv',
 'Data_Coursera_CaseStudy02/minuteCaloriesWide_merged.csv',
 'Data_Coursera_CaseStudy02/minuteIntensitiesNarrow_merged.csv',
 'Data_Coursera_CaseStudy02/minuteIntensitiesWide_merged.csv',
 'Data_Coursera_CaseStudy02/minuteMETsNarrow_merged.csv',
 'Data_Coursera_CaseStudy02/minuteSleep_merged.csv',
 'Data_Coursera_CaseStudy02/minuteStepsNarrow_merged.csv',
 'Data_Coursera_CaseStudy02/minuteStepsWide_merged.csv',
 'Data_Coursera_CaseStudy02/sleepDay_merged.csv',
 'Data_Coursera_CaseStudy02/weightLogInfo_merged.csv']
In [4]:
# Checking the data for bias:


# 1 - Checking the sample size

# For every dataset:
# Check the number of ID (people)
# Check the total line count
# Check the missig values

for archive in archives:
  new_df = pd.read_csv(archive)

  print(f"archive: { archive.replace('Data_Coursera_CaseStudy02/',' ') }")
  print(f'Ids: {len(new_df.Id.unique())}')
  print(f'Total line count: {len(new_df)}')
  print(f'Missing values: {new_df.isnull().any(axis=1).sum()}')
  print("-------------------------------------------")
archive:  dailyActivity_merged.csv
Ids: 33
Total line count: 863
Missing values: 0
-------------------------------------------
archive:  dailyCalories_merged.csv
Ids: 33
Total line count: 940
Missing values: 0
-------------------------------------------
archive:  dailyIntensities_merged.csv
Ids: 33
Total line count: 940
Missing values: 0
-------------------------------------------
archive:  dailySteps_merged.csv
Ids: 33
Total line count: 863
Missing values: 0
-------------------------------------------
archive:  heartrate_seconds_merged.csv
Ids: 14
Total line count: 2483658
Missing values: 0
-------------------------------------------
archive:  hourlyCalories_merged.csv
Ids: 33
Total line count: 22099
Missing values: 0
-------------------------------------------
archive:  hourlyIntensities_merged.csv
Ids: 33
Total line count: 22099
Missing values: 0
-------------------------------------------
archive:  hourlySteps_merged.csv
Ids: 33
Total line count: 22099
Missing values: 0
-------------------------------------------
archive:  minuteCaloriesNarrow_merged.csv
Ids: 33
Total line count: 1325580
Missing values: 0
-------------------------------------------
archive:  minuteCaloriesWide_merged.csv
Ids: 33
Total line count: 21645
Missing values: 0
-------------------------------------------
archive:  minuteIntensitiesNarrow_merged.csv
Ids: 33
Total line count: 1325580
Missing values: 0
-------------------------------------------
archive:  minuteIntensitiesWide_merged.csv
Ids: 33
Total line count: 21645
Missing values: 0
-------------------------------------------
archive:  minuteMETsNarrow_merged.csv
Ids: 33
Total line count: 1325573
Missing values: 0
-------------------------------------------
archive:  minuteSleep_merged.csv
Ids: 24
Total line count: 187978
Missing values: 0
-------------------------------------------
archive:  minuteStepsNarrow_merged.csv
Ids: 33
Total line count: 1325580
Missing values: 0
-------------------------------------------
archive:  minuteStepsWide_merged.csv
Ids: 33
Total line count: 21645
Missing values: 0
-------------------------------------------
archive:  sleepDay_merged.csv
Ids: 24
Total line count: 410
Missing values: 0
-------------------------------------------
archive:  weightLogInfo_merged.csv
Ids: 8
Total line count: 67
Missing values: 0
-------------------------------------------

Data: https://www.kaggle.com/datasets/arashnic/fitbit?resource=download

Process phase:¶

As you already can see, I will use a Jupyter Notebook to perform this analysis.

First of all, I downloaded the data and put them into a folder called 'Data_Coursera_CaseStudy02', so I will read every file from there.

To ensure that the data is clean, I will perform this steps in every archive:

  1. Checked for duplicates;
  2. Checked for rows with no data;
  3. Checked for inconsistent data;

Is there any duplicated value in the files?

In [5]:
for archive in archives:
  new_df = pd.read_csv(archive)

  print(f"archive: { archive.replace('Data_Coursera_CaseStudy02/',' ') }")
  print(f'Duplicated lines: {new_df.duplicated().sum()}')
  print("-------------------------------------------")
archive:  dailyActivity_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  dailyCalories_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  dailyIntensities_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  dailySteps_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  heartrate_seconds_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  hourlyCalories_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  hourlyIntensities_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  hourlySteps_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  minuteCaloriesNarrow_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  minuteCaloriesWide_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  minuteIntensitiesNarrow_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  minuteIntensitiesWide_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  minuteMETsNarrow_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  minuteSleep_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  minuteStepsNarrow_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  minuteStepsWide_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  sleepDay_merged.csv
Duplicated lines: 0
-------------------------------------------
archive:  weightLogInfo_merged.csv
Duplicated lines: 0
-------------------------------------------

Duplicated rows were founded in two of the datasets:


archive: minuteSleep_merged.csv

Duplicated lines: 543


archive: sleepDay_merged.csv

Duplicated lines: 3


Removing duplicates

In [6]:
# Removing duplicates
for archive in archives:
  new_df = pd.read_csv(archive)
  if(new_df.duplicated().sum() > 0 ):
    print(f"Removing the duplicates from: {archive.replace('Data_Coursera_CaseStudy02/',' ')}")
    new_df.drop_duplicates(keep='first',inplace=True)
    # Saving the Dataframe without the duplicated values
    new_df.to_csv(archive)

# Removing the duplicates from:  minuteSleep_merged.csv
# Removing the duplicates from:  sleepDay_merged.csv

Is there any missing value in the files?¶

In [7]:
# Searching for rows or column with no data:


for archive in archives:
  new_df = pd.read_csv(archive)

  print(f"archive: { archive.replace('Data_Coursera_CaseStudy02/',' ') }")
  print(f'Total line count: {len(new_df)}')
  print(f'Missing values: {new_df.isnull().any(axis=1).sum()}')
  print("-------------------------------------------") 
archive:  dailyActivity_merged.csv
Total line count: 863
Missing values: 0
-------------------------------------------
archive:  dailyCalories_merged.csv
Total line count: 940
Missing values: 0
-------------------------------------------
archive:  dailyIntensities_merged.csv
Total line count: 940
Missing values: 0
-------------------------------------------
archive:  dailySteps_merged.csv
Total line count: 863
Missing values: 0
-------------------------------------------
archive:  heartrate_seconds_merged.csv
Total line count: 2483658
Missing values: 0
-------------------------------------------
archive:  hourlyCalories_merged.csv
Total line count: 22099
Missing values: 0
-------------------------------------------
archive:  hourlyIntensities_merged.csv
Total line count: 22099
Missing values: 0
-------------------------------------------
archive:  hourlySteps_merged.csv
Total line count: 22099
Missing values: 0
-------------------------------------------
archive:  minuteCaloriesNarrow_merged.csv
Total line count: 1325580
Missing values: 0
-------------------------------------------
archive:  minuteCaloriesWide_merged.csv
Total line count: 21645
Missing values: 0
-------------------------------------------
archive:  minuteIntensitiesNarrow_merged.csv
Total line count: 1325580
Missing values: 0
-------------------------------------------
archive:  minuteIntensitiesWide_merged.csv
Total line count: 21645
Missing values: 0
-------------------------------------------
archive:  minuteMETsNarrow_merged.csv
Total line count: 1325573
Missing values: 0
-------------------------------------------
archive:  minuteSleep_merged.csv
Total line count: 187978
Missing values: 0
-------------------------------------------
archive:  minuteStepsNarrow_merged.csv
Total line count: 1325580
Missing values: 0
-------------------------------------------
archive:  minuteStepsWide_merged.csv
Total line count: 21645
Missing values: 0
-------------------------------------------
archive:  sleepDay_merged.csv
Total line count: 410
Missing values: 0
-------------------------------------------
archive:  weightLogInfo_merged.csv
Total line count: 67
Missing values: 0
-------------------------------------------

Missing values detected in:


archive: weightLogInfo_merged.csv

Total line count: 67

Missing values: 65


But this feels strange.

Have checked the data, and that is a column with only 2 values.

Need to remove that column, because we will not use that data.

Removing the column 'Fat' from 'weightLogInfo_merged.csv' :

Evaluating for inconsistent data:¶

In [8]:
for archive in archives:
  new_df = pd.read_csv(archive)

  print(f"archive: { archive.replace('Data_Coursera_CaseStudy02/',' ') }")
  print(f'Description:')
  print(new_df.describe())
  print("-------------------------------------------") 
archive:  dailyActivity_merged.csv
Description:
       Unnamed: 0.1  Unnamed: 0            Id    TotalSteps  TotalDistance  \
count    863.000000  863.000000  8.630000e+02    863.000000     863.000000   
mean     431.000000  470.404403  4.857542e+09   8319.392816       5.979513   
std      249.270937  270.542606  2.418405e+09   4744.967224       3.721044   
min        0.000000    0.000000  1.503960e+09      4.000000       0.000000   
25%      215.500000  240.500000  2.320127e+09   4923.000000       3.370000   
50%      431.000000  471.000000  4.445115e+09   8053.000000       5.590000   
75%      646.500000  708.500000  6.962181e+09  11092.500000       7.900000   
max      862.000000  939.000000  8.877689e+09  36019.000000      28.030001   

       TrackerDistance  LoggedActivitiesDistance  VeryActiveDistance  \
count       863.000000                863.000000          863.000000   
mean          5.963882                  0.117822            1.636756   
std           3.703191                  0.646111            2.735289   
min           0.000000                  0.000000            0.000000   
25%           3.370000                  0.000000            0.000000   
50%           5.590000                  0.000000            0.410000   
75%           7.880000                  0.000000            2.275000   
max          28.030001                  4.942142           21.920000   

       ModeratelyActiveDistance  LightActiveDistance  SedentaryActiveDistance  \
count                863.000000           863.000000               863.000000   
mean                   0.618181             3.638899                 0.001750   
std                    0.905049             1.857503                 0.007651   
min                    0.000000             0.000000                 0.000000   
25%                    0.000000             2.345000                 0.000000   
50%                    0.310000             3.580000                 0.000000   
75%                    0.865000             4.895000                 0.000000   
max                    6.480000            10.710000                 0.110000   

       VeryActiveMinutes  FairlyActiveMinutes  LightlyActiveMinutes  \
count         863.000000           863.000000            863.000000   
mean           23.015064            14.775203            210.016222   
std            33.646118            20.427405             96.781296   
min             0.000000             0.000000              0.000000   
25%             0.000000             0.000000            146.500000   
50%             7.000000             8.000000            208.000000   
75%            35.000000            21.000000            272.000000   
max           210.000000           143.000000            518.000000   

       SedentaryMinutes     Calories  
count        863.000000   863.000000  
mean         955.753187  2361.295481  
std          280.293359   702.711148  
min            0.000000    52.000000  
25%          721.500000  1855.500000  
50%         1021.000000  2220.000000  
75%         1189.000000  2832.000000  
max         1440.000000  4900.000000  
-------------------------------------------
archive:  dailyCalories_merged.csv
Description:
                 Id     Calories
count  9.400000e+02   940.000000
mean   4.855407e+09  2303.609574
std    2.424805e+09   718.166862
min    1.503960e+09     0.000000
25%    2.320127e+09  1828.500000
50%    4.445115e+09  2134.000000
75%    6.962181e+09  2793.250000
max    8.877689e+09  4900.000000
-------------------------------------------
archive:  dailyIntensities_merged.csv
Description:
                 Id  SedentaryMinutes  LightlyActiveMinutes  \
count  9.400000e+02        940.000000            940.000000   
mean   4.855407e+09        991.210638            192.812766   
std    2.424805e+09        301.267437            109.174700   
min    1.503960e+09          0.000000              0.000000   
25%    2.320127e+09        729.750000            127.000000   
50%    4.445115e+09       1057.500000            199.000000   
75%    6.962181e+09       1229.500000            264.000000   
max    8.877689e+09       1440.000000            518.000000   

       FairlyActiveMinutes  VeryActiveMinutes  SedentaryActiveDistance  \
count           940.000000         940.000000               940.000000   
mean             13.564894          21.164894                 0.001606   
std              19.987404          32.844803                 0.007346   
min               0.000000           0.000000                 0.000000   
25%               0.000000           0.000000                 0.000000   
50%               6.000000           4.000000                 0.000000   
75%              19.000000          32.000000                 0.000000   
max             143.000000         210.000000                 0.110000   

       LightActiveDistance  ModeratelyActiveDistance  VeryActiveDistance  
count           940.000000                940.000000          940.000000  
mean              3.340819                  0.567543            1.502681  
std               2.040655                  0.883580            2.658941  
min               0.000000                  0.000000            0.000000  
25%               1.945000                  0.000000            0.000000  
50%               3.365000                  0.240000            0.210000  
75%               4.782500                  0.800000            2.052500  
max              10.710000                  6.480000           21.920000  
-------------------------------------------
archive:  dailySteps_merged.csv
Description:
       Unnamed: 0.1  Unnamed: 0            Id     StepTotal
count    863.000000  863.000000  8.630000e+02    863.000000
mean     431.000000  470.404403  4.857542e+09   8319.392816
std      249.270937  270.542606  2.418405e+09   4744.967224
min        0.000000    0.000000  1.503960e+09      4.000000
25%      215.500000  240.500000  2.320127e+09   4923.000000
50%      431.000000  471.000000  4.445115e+09   8053.000000
75%      646.500000  708.500000  6.962181e+09  11092.500000
max      862.000000  939.000000  8.877689e+09  36019.000000
-------------------------------------------
archive:  heartrate_seconds_merged.csv
Description:
                 Id         Value
count  2.483658e+06  2.483658e+06
mean   5.513765e+09  7.732842e+01
std    1.950224e+09  1.940450e+01
min    2.022484e+09  3.600000e+01
25%    4.388162e+09  6.300000e+01
50%    5.553957e+09  7.300000e+01
75%    6.962181e+09  8.800000e+01
max    8.877689e+09  2.030000e+02
-------------------------------------------
archive:  hourlyCalories_merged.csv
Description:
                 Id      Calories
count  2.209900e+04  22099.000000
mean   4.848235e+09     97.386760
std    2.422500e+09     60.702622
min    1.503960e+09     42.000000
25%    2.320127e+09     63.000000
50%    4.445115e+09     83.000000
75%    6.962181e+09    108.000000
max    8.877689e+09    948.000000
-------------------------------------------
archive:  hourlyIntensities_merged.csv
Description:
                 Id  TotalIntensity  AverageIntensity
count  2.209900e+04    22099.000000      22099.000000
mean   4.848235e+09       12.035341          0.200589
std    2.422500e+09       21.133110          0.352219
min    1.503960e+09        0.000000          0.000000
25%    2.320127e+09        0.000000          0.000000
50%    4.445115e+09        3.000000          0.050000
75%    6.962181e+09       16.000000          0.266667
max    8.877689e+09      180.000000          3.000000
-------------------------------------------
archive:  hourlySteps_merged.csv
Description:
                 Id     StepTotal
count  2.209900e+04  22099.000000
mean   4.848235e+09    320.166342
std    2.422500e+09    690.384228
min    1.503960e+09      0.000000
25%    2.320127e+09      0.000000
50%    4.445115e+09     40.000000
75%    6.962181e+09    357.000000
max    8.877689e+09  10554.000000
-------------------------------------------
archive:  minuteCaloriesNarrow_merged.csv
Description:
                 Id      Calories
count  1.325580e+06  1.325580e+06
mean   4.847898e+09  1.623130e+00
std    2.422313e+09  1.410447e+00
min    1.503960e+09  0.000000e+00
25%    2.320127e+09  9.357000e-01
50%    4.445115e+09  1.217600e+00
75%    6.962181e+09  1.432700e+00
max    8.877689e+09  1.974995e+01
-------------------------------------------
archive:  minuteCaloriesWide_merged.csv
Description:
                 Id    Calories00    Calories01    Calories02    Calories03  \
count  2.164500e+04  21645.000000  21645.000000  21645.000000  21645.000000   
mean   4.836965e+09      1.622629      1.626377      1.637824      1.635515   
std    2.424088e+09      1.398418      1.395083      1.408828      1.419590   
min    1.503960e+09      0.702700      0.702700      0.702700      0.702700   
25%    2.320127e+09      0.935700      0.935700      0.937680      0.935700   
50%    4.445115e+09      1.217600      1.217600      1.220400      1.218500   
75%    6.962181e+09      1.432700      1.432700      1.432700      1.432700   
max    8.877689e+09     19.727337     19.727337     19.727337     19.727337   

         Calories04    Calories05    Calories06    Calories07    Calories08  \
count  21645.000000  21645.000000  21645.000000  21645.000000  21645.000000   
mean       1.637997      1.638306      1.639910      1.629520      1.623686   
std        1.433532      1.438253      1.435465      1.424092      1.411596   
min        0.702700      0.702700      0.702700      0.702700      0.702700   
25%        0.935700      0.935700      0.935700      0.935700      0.935700   
50%        1.218500      1.218500      1.218500      1.217600      1.217600   
75%        1.432700      1.432700      1.432700      1.432700      1.432700   
max       19.727337     19.727337     19.727337     19.727337     19.727337   

       ...    Calories50    Calories51    Calories52    Calories53  \
count  ...  21645.000000  21645.000000  21645.000000  21645.000000   
mean   ...      1.623665      1.613643      1.620958      1.618227   
std    ...      1.407171      1.395206      1.407914      1.400498   
min    ...      0.702700      0.702700      0.702700      0.702700   
25%    ...      0.935700      0.935700      0.935700      0.935700   
50%    ...      1.217600      1.217600      1.217600      1.217600   
75%    ...      1.432700      1.432700      1.432700      1.432700   
max    ...     19.749947     19.749947     19.749947     19.749947   

         Calories54    Calories55    Calories56    Calories57    Calories58  \
count  21645.000000  21645.000000  21645.000000  21645.000000  21645.000000   
mean       1.621229      1.615972      1.608714      1.612657      1.611715   
std        1.408974      1.392530      1.376827      1.369097      1.374954   
min        0.702700      0.702700      0.702700      0.702700      0.702700   
25%        0.935700      0.935700      0.935700      0.935700      0.935700   
50%        1.217600      1.217600      1.217600      1.217600      1.217600   
75%        1.432700      1.432700      1.432700      1.432700      1.432700   
max       19.749947     19.749947     19.727337     19.727337     19.727337   

         Calories59  
count  21645.000000  
mean       1.612110  
std        1.373888  
min        0.000000  
25%        0.935700  
50%        1.217600  
75%        1.432700  
max       19.727337  

[8 rows x 61 columns]
-------------------------------------------
archive:  minuteIntensitiesNarrow_merged.csv
Description:
                 Id     Intensity
count  1.325580e+06  1.325580e+06
mean   4.847898e+09  2.005937e-01
std    2.422313e+09  5.190227e-01
min    1.503960e+09  0.000000e+00
25%    2.320127e+09  0.000000e+00
50%    4.445115e+09  0.000000e+00
75%    6.962181e+09  0.000000e+00
max    8.877689e+09  3.000000e+00
-------------------------------------------
archive:  minuteIntensitiesWide_merged.csv
Description:
                 Id   Intensity00   Intensity01   Intensity02   Intensity03  \
count  2.164500e+04  21645.000000  21645.000000  21645.000000  21645.000000   
mean   4.836965e+09      0.199723      0.203326      0.208177      0.203835   
std    2.424088e+09      0.509819      0.515432      0.521394      0.518137   
min    1.503960e+09      0.000000      0.000000      0.000000      0.000000   
25%    2.320127e+09      0.000000      0.000000      0.000000      0.000000   
50%    4.445115e+09      0.000000      0.000000      0.000000      0.000000   
75%    6.962181e+09      0.000000      0.000000      0.000000      0.000000   
max    8.877689e+09      3.000000      3.000000      3.000000      3.000000   

        Intensity04   Intensity05   Intensity06   Intensity07   Intensity08  \
count  21645.000000  21645.000000  21645.000000  21645.000000  21645.000000   
mean       0.205082      0.204897      0.206560      0.201894      0.202310   
std        0.521956      0.521054      0.523053      0.519074      0.522594   
min        0.000000      0.000000      0.000000      0.000000      0.000000   
25%        0.000000      0.000000      0.000000      0.000000      0.000000   
50%        0.000000      0.000000      0.000000      0.000000      0.000000   
75%        0.000000      0.000000      0.000000      0.000000      0.000000   
max        3.000000      3.000000      3.000000      3.000000      3.000000   

       ...   Intensity50   Intensity51   Intensity52   Intensity53  \
count  ...  21645.000000  21645.000000  21645.000000  21645.000000   
mean   ...      0.201016      0.195796      0.198337      0.199399   
std    ...      0.514814      0.510299      0.511264      0.513331   
min    ...      0.000000      0.000000      0.000000      0.000000   
25%    ...      0.000000      0.000000      0.000000      0.000000   
50%    ...      0.000000      0.000000      0.000000      0.000000   
75%    ...      0.000000      0.000000      0.000000      0.000000   
max    ...      3.000000      3.000000      3.000000      3.000000   

        Intensity54   Intensity55   Intensity56   Intensity57   Intensity58  \
count  21645.000000  21645.000000  21645.000000  21645.000000  21645.000000   
mean       0.200139      0.198753      0.195565      0.199122      0.198244   
std        0.512142      0.511238      0.506435      0.511907      0.510124   
min        0.000000      0.000000      0.000000      0.000000      0.000000   
25%        0.000000      0.000000      0.000000      0.000000      0.000000   
50%        0.000000      0.000000      0.000000      0.000000      0.000000   
75%        0.000000      0.000000      0.000000      0.000000      0.000000   
max        3.000000      3.000000      3.000000      3.000000      3.000000   

        Intensity59  
count  21645.000000  
mean       0.195426  
std        0.503423  
min        0.000000  
25%        0.000000  
50%        0.000000  
75%        0.000000  
max        3.000000  

[8 rows x 61 columns]
-------------------------------------------
archive:  minuteMETsNarrow_merged.csv
Description:
       Unnamed: 0.1    Unnamed: 0            Id          METs
count  1.325573e+06  1.325573e+06  1.325573e+06  1.325573e+06
mean   6.627860e+05  6.627889e+05  4.847895e+09  1.469009e+01
std    3.826601e+05  3.826619e+05  2.422313e+09  1.205539e+01
min    0.000000e+00  0.000000e+00  1.503960e+09  6.000000e+00
25%    3.313930e+05  3.313950e+05  2.320127e+09  1.000000e+01
50%    6.627860e+05  6.627880e+05  4.445115e+09  1.000000e+01
75%    9.941790e+05  9.941820e+05  6.962181e+09  1.100000e+01
max    1.325572e+06  1.325579e+06  8.877689e+09  1.570000e+02
-------------------------------------------
archive:  minuteSleep_merged.csv
Description:
          Unnamed: 0            Id          value         logId
count  187978.000000  1.879780e+05  187978.000000  1.879780e+05
mean    94243.300796  4.997443e+09       1.095937  1.149589e+10
std     54499.126136  2.069872e+09       0.328912  6.820112e+07
min         0.000000  1.503960e+09       1.000000  1.137223e+10
25%     46994.250000  3.977334e+09       1.000000  1.143931e+10
50%     93988.500000  4.702922e+09       1.000000  1.150114e+10
75%    141525.750000  6.962181e+09       1.000000  1.155221e+10
max    188520.000000  8.792010e+09       3.000000  1.161625e+10
-------------------------------------------
archive:  minuteStepsNarrow_merged.csv
Description:
                 Id         Steps
count  1.325580e+06  1.325580e+06
mean   4.847898e+09  5.336192e+00
std    2.422313e+09  1.812830e+01
min    1.503960e+09  0.000000e+00
25%    2.320127e+09  0.000000e+00
50%    4.445115e+09  0.000000e+00
75%    6.962181e+09  0.000000e+00
max    8.877689e+09  2.200000e+02
-------------------------------------------
archive:  minuteStepsWide_merged.csv
Description:
                 Id       Steps00       Steps01       Steps02       Steps03  \
count  2.164500e+04  21645.000000  21645.000000  21645.000000  21645.000000   
mean   4.836965e+09      5.304366      5.335412      5.531439      5.469439   
std    2.424088e+09     17.783331     17.678358     18.079791     18.106414   
min    1.503960e+09      0.000000      0.000000      0.000000      0.000000   
25%    2.320127e+09      0.000000      0.000000      0.000000      0.000000   
50%    4.445115e+09      0.000000      0.000000      0.000000      0.000000   
75%    6.962181e+09      0.000000      0.000000      0.000000      0.000000   
max    8.877689e+09    186.000000    180.000000    182.000000    182.000000   

            Steps04       Steps05       Steps06       Steps07      Steps08  \
count  21645.000000  21645.000000  21645.000000  21645.000000  21645.00000   
mean       5.461862      5.590252      5.559483      5.412474      5.35879   
std       18.288469     18.565165     18.484912     18.335665     18.20523   
min        0.000000      0.000000      0.000000      0.000000      0.00000   
25%        0.000000      0.000000      0.000000      0.000000      0.00000   
50%        0.000000      0.000000      0.000000      0.000000      0.00000   
75%        0.000000      0.000000      0.000000      0.000000      0.00000   
max      181.000000    180.000000    181.000000    183.000000    180.00000   

       ...       Steps50       Steps51       Steps52       Steps53  \
count  ...  21645.000000  21645.000000  21645.000000  21645.000000   
mean   ...      5.329175      5.194456      5.225595      5.145484   
std    ...     17.870527     17.601857     17.618497     17.570195   
min    ...      0.000000      0.000000      0.000000      0.000000   
25%    ...      0.000000      0.000000      0.000000      0.000000   
50%    ...      0.000000      0.000000      0.000000      0.000000   
75%    ...      0.000000      0.000000      0.000000      0.000000   
max    ...    182.000000    181.000000    181.000000    181.000000   

            Steps54       Steps55       Steps56       Steps57       Steps58  \
count  21645.000000  21645.000000  21645.000000  21645.000000  21645.000000   
mean       5.223654      5.281220      5.179533      5.251836      5.143636   
std       17.684634     17.828413     17.569268     17.686583     17.427494   
min        0.000000      0.000000      0.000000      0.000000      0.000000   
25%        0.000000      0.000000      0.000000      0.000000      0.000000   
50%        0.000000      0.000000      0.000000      0.000000      0.000000   
75%        0.000000      0.000000      0.000000      0.000000      0.000000   
max      184.000000    181.000000    182.000000    182.000000    180.000000   

            Steps59  
count  21645.000000  
mean       5.288935  
std       17.721454  
min        0.000000  
25%        0.000000  
50%        0.000000  
75%        0.000000  
max      189.000000  

[8 rows x 61 columns]
-------------------------------------------
archive:  sleepDay_merged.csv
Description:
       Unnamed: 0            Id  TotalSleepRecords  TotalMinutesAsleep  \
count  410.000000  4.100000e+02         410.000000          410.000000   
mean   205.643902  4.994963e+09           1.119512          419.173171   
std    119.470511  2.060863e+09           0.346636          118.635918   
min      0.000000  1.503960e+09           1.000000           58.000000   
25%    102.250000  3.977334e+09           1.000000          361.000000   
50%    205.500000  4.702922e+09           1.000000          432.500000   
75%    308.750000  6.962181e+09           1.000000          490.000000   
max    412.000000  8.792010e+09           3.000000          796.000000   

       TotalTimeInBed  
count      410.000000  
mean       458.482927  
std        127.455140  
min         61.000000  
25%        403.750000  
50%        463.000000  
75%        526.000000  
max        961.000000  
-------------------------------------------
archive:  weightLogInfo_merged.csv
Description:
       Unnamed: 0            Id    WeightKg  WeightPounds        BMI  \
count   67.000000  6.700000e+01   67.000000     67.000000  67.000000   
mean    33.000000  7.009282e+09   72.035821    158.811801  25.185224   
std     19.485037  1.950322e+09   13.923206     30.695415   3.066963   
min      0.000000  1.503960e+09   52.599998    115.963147  21.450001   
25%     16.500000  6.962181e+09   61.400002    135.363832  23.959999   
50%     33.000000  6.962181e+09   62.500000    137.788914  24.389999   
75%     49.500000  8.877689e+09   85.049999    187.503152  25.559999   
max     66.000000  8.877689e+09  133.500000    294.317120  47.540001   

              LogId  
count  6.700000e+01  
mean   1.461772e+12  
std    7.829948e+08  
min    1.460444e+12  
25%    1.461079e+12  
50%    1.461802e+12  
75%    1.462375e+12  
max    1.463098e+12  
-------------------------------------------

Evaluating inconsistent data¶


from the archive: dailyCalories_merged.csv

We can se that there is a minimum of 0 calories that someoned burned in a day. Ok, it's impossible. These data needs to be erased from the dataset.


From the archive: dailyIntensities_merged.csv

The sum of the columns: SedentaryMinutes + LightlyActiveMinutes + FairlyActiveMinutes + VeryActiveMinutes must be 1440 (the total minutes of a day).

In some days, the total sum isn't equal to 1440, maybe this is because some battery discharge of the gadgets, or the data was altered.

Evaluating the number of rows that doesn't sum 1440, we keep with 462 (49.1% of the data). So I will continue with this data


From the archive: dailyActivity_merged.csv

All the days with 0 steps will be excluded. Probably on those days the volunteers did not use the gadget. There are 77 rows with 0 TotalSteps data.


From the archive: dailySteps_merged.csv

The same situation as the archive above. Deleted the data with 0 steps.


From the archive: hourlySteps_merged.csv

Here, the 0 steps value has significant, but some days with a total of 0 values no. Days that sum a total of 0 steps will be deleted, since they are of no use to us (Maybe these days make difference in the analysis phase, so I will keep with them and come back later if I didn't find anything).


From the archive: minuteMETsNarrow_merged.csv

In some reasearches and "some" help of ChatGPT (that gave me the wrong information), there are no activity that we, human, execute that costs less than 0.95.

From the Compendium of Physical Activities:

https://sites.google.com/site/compendiumofphysicalactivities/Activity-Categories/inactivity?authuser=0

Sleep has a value of 0.95 METs. So, every 0 METs value in the dataset will be deleted.

To anyone that want to know what MET is, MET is the acronym to Metabolic Equivalent of Task. It is a unit that measures how much energy an activity consumes compared to being at rest.

I get this information from: https://www.omnicalculator.com/sports/met-minutes-per-week#met-definition


From the archive: weightLogInfo_merged.csv

We must remove the column Fat, since has only 2 values.

In [27]:
# Analyzing dailyIntensities_merged.csv
dataset = 'Data_Coursera_CaseStudy02/dailyIntensities_merged.csv'
df_analysis = pd.read_csv(dataset)
df_analysis['Sum'] = df_analysis['SedentaryMinutes'] + df_analysis['LightlyActiveMinutes'] + df_analysis['FairlyActiveMinutes'] + df_analysis['VeryActiveMinutes']
len(df_analysis[df_analysis['Sum']!=1440])
Out[27]:
462
In [9]:
# Analyzing dailyActivity_merged.csv
dataset = 'Data_Coursera_CaseStudy02/dailyActivity_merged.csv'
df_analysis = pd.read_csv(dataset)
df_analysis
# df_analysis['TotalSteps'] = df_analysis['SedentaryMinutes'] + df_analysis['LightlyActiveMinutes'] + df_analysis['FairlyActiveMinutes'] + df_analysis['VeryActiveMinutes']
df_analysis = df_analysis[df_analysis['TotalSteps']!=0]

df_analysis.to_csv( 'Data_Coursera_CaseStudy02/dailyActivity_merged.csv')
In [10]:
# Analyzing dailySteps_merged.csv
dataset = 'Data_Coursera_CaseStudy02/dailySteps_merged.csv'
df_analysis = pd.read_csv(dataset)
df_analysis

df_analysis = df_analysis[df_analysis['StepTotal']!=0]

df_analysis.to_csv( 'Data_Coursera_CaseStudy02/dailySteps_merged.csv')
In [11]:
# Analyzing minuteMETsNarrow_merged.csv
dataset = 'Data_Coursera_CaseStudy02/minuteMETsNarrow_merged.csv'
df_analysis = pd.read_csv(dataset)
# df_analysis
# df_analysis['TotalSteps'] = df_analysis['SedentaryMinutes'] + df_analysis['LightlyActiveMinutes'] + df_analysis['FairlyActiveMinutes'] + df_analysis['VeryActiveMinutes']
df_analysis =df_analysis[df_analysis['METs']!=0]

df_analysis.to_csv( 'Data_Coursera_CaseStudy02/minuteMETsNarrow_merged.csv')
In [12]:
# Removing the column 'Fat' from 'weightLogInfo_merged.csv' :
dataset = 'Data_Coursera_CaseStudy02/weightLogInfo_merged.csv'
df_drop_column = pd.read_csv(dataset)
df_drop_column.drop(columns=['Fat'],inplace=True)
df_drop_column.to_csv(dataset)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[12], line 4
      2 dataset = 'Data_Coursera_CaseStudy02/weightLogInfo_merged.csv'
      3 df_drop_column = pd.read_csv(dataset)
----> 4 df_drop_column.drop(columns=['Fat'],inplace=True)
      5 df_drop_column.to_csv(dataset)

File ~\anaconda3\lib\site-packages\pandas\util\_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~\anaconda3\lib\site-packages\pandas\core\frame.py:5399, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   5251 @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "labels"])
   5252 def drop(  # type: ignore[override]
   5253     self,
   (...)
   5260     errors: IgnoreRaise = "raise",
   5261 ) -> DataFrame | None:
   5262     """
   5263     Drop specified labels from rows or columns.
   5264 
   (...)
   5397             weight  1.0     0.8
   5398     """
-> 5399     return super().drop(
   5400         labels=labels,
   5401         axis=axis,
   5402         index=index,
   5403         columns=columns,
   5404         level=level,
   5405         inplace=inplace,
   5406         errors=errors,
   5407     )

File ~\anaconda3\lib\site-packages\pandas\util\_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~\anaconda3\lib\site-packages\pandas\core\generic.py:4505, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   4503 for axis, labels in axes.items():
   4504     if labels is not None:
-> 4505         obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4507 if inplace:
   4508     self._update_inplace(obj)

File ~\anaconda3\lib\site-packages\pandas\core\generic.py:4546, in NDFrame._drop_axis(self, labels, axis, level, errors, only_slice)
   4544         new_axis = axis.drop(labels, level=level, errors=errors)
   4545     else:
-> 4546         new_axis = axis.drop(labels, errors=errors)
   4547     indexer = axis.get_indexer(new_axis)
   4549 # Case for non-unique axis
   4550 else:

File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:6934, in Index.drop(self, labels, errors)
   6932 if mask.any():
   6933     if errors != "ignore":
-> 6934         raise KeyError(f"{list(labels[mask])} not found in axis")
   6935     indexer = indexer[~mask]
   6936 return self.delete(indexer)

KeyError: "['Fat'] not found in axis"

Analyze and Share phase:¶

OK. Now we need to develop our hypothesis on what could this data show to us.

Average Number of steps¶

In [13]:
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
print(f"Average number of steps taken by day: {np.average(dailySteps['StepTotal'])} steps")
Average number of steps taken by day: 8319.39281575898 steps

Doing a research, to be considered active, a person should do 10.000 steps in a day. Of course in our lives we have to balancing active life with our jobs, but seeing this averave from all the users, we can ensure that most of them are not active people.

Average Calories burned by day¶

In [ ]:
caloriesDay = pd.read_csv("Data_Coursera_CaseStudy02/dailyCalories_merged.csv", index_col=[0])
print(f"Average Calories by day: {np.average(caloriesDay['Calories'])} calories")
Average Calories by day: 2303.609574468085 calories

This value of burned calories in a day is in line with what research shows that both men and adult women must burn daily to maintain their weight

Average Time in Activities¶

In [ ]:
dailyActivity = pd.read_csv("Data_Coursera_CaseStudy02/dailyActivity_merged.csv", index_col=[0])
print(f"Average Time by day in lightly activities: {np.average(dailyActivity['LightlyActiveMinutes']):.2f} minutes")
print(f"Average Time by day in fairly activities: {np.average(dailyActivity['FairlyActiveMinutes']):.2f} minutes")
print(f"Average Time by day in very active activities: {np.average(dailyActivity['VeryActiveMinutes']):.2f} minutes")
Average Time by day in lightly activities: 210.02 minutes
Average Time by day in fairly activities: 14.78 minutes
Average Time by day in very active activities: 23.02 minutes

In a research with the help of the Bing AI:

According to the World Health Organization (WHO), adults should spend at least 180 minutes in a variety of types of physical activities at any intensity, of which at least 60 minutes is moderate- to vigorous-intensity physical activity, spread throughout the day; more is better. The Centers for Disease Control and Prevention (CDC) recommends that adults need 150 minutes of moderate-intensity physical activity and 2 days of muscle strengthening activity each week.

So, we can see that the average is about 210 minuites in lightly activities, that can maybe be related with work. So, we can assure that most of this users need to take more time to do more fairly and active activities.

Steps through the day¶

In [14]:
# Bar chart: steps through the day

hourlySteps = pd.read_csv("Data_Coursera_CaseStudy02/hourlySteps_merged.csv", index_col=[0])
hourlySteps['ActivityHour'] = pd.to_datetime(hourlySteps['ActivityHour'])

hourlySteps['Day'] = hourlySteps['ActivityHour'].dt.date
hourlySteps['Time'] =  hourlySteps['ActivityHour'].dt.time


stepsMean_By_hour = []

for hour in (hourlySteps['Time'].unique()): 
  stepsMean_By_hour.append(np.average(hourlySteps[hourlySteps['Time']==hour]['StepTotal']))



# df['datetime'] = pd.to_datetime(df['datetime'])

data = {'hours': hourlySteps['Time'].unique(),
        'Average steps': stepsMean_By_hour}

fig = px.bar(data, x='hours', y='Average steps', title = 'Average steps by hour')

fig.update_layout(xaxis=dict(
                    title = 'Hours',
                    ), 
                  yaxis=dict(
                    title='Average steps',
                    side='left'                 
                    )
)

fig.show()

Here we can see something that would be expected, most of the steps are in the period of the work times and after it, maybe time to go away or go to the gym?

Classification of users¶

Users with an average steps more than 10.000 were classified as active.

In [15]:
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])

dailySteps = dailySteps.groupby(by=['Id']).mean()

status_list = []

for id in dailySteps.index:
  if(dailySteps.loc[id]['StepTotal']>=10000):
    status_list.append('Active')
  else:
    status_list.append('Not Active')

dailySteps['Status'] = status_list

data = {'Status': ['Active', 'Not Active'],
        'number of status': [ len(dailySteps[dailySteps['Status']=='Active']), len(dailySteps[dailySteps['Status']=='Not Active'])]}


fig = px.pie(data, values='number of status', names='Status', title='Percentage of active users based on 10.000 steps daily')
fig.show()

# print(f"Average number of steps taken by day: {np.average(dailySteps['StepTotal'])} steps")
C:\Users\Spuck\AppData\Local\Temp\ipykernel_19412\663014179.py:3: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

As evidenced above in the average daily steps calculation. When we separate the calculation by user, we can see that almost 80% of them do not do 10.000 steps by day.

Intensities by hour¶

In [16]:
# Bar chart: intensities through the day

hourlyIntensities = pd.read_csv("Data_Coursera_CaseStudy02/hourlyIntensities_merged.csv", index_col=[0])
hourlyIntensities['ActivityHour'] = pd.to_datetime(hourlyIntensities['ActivityHour'])

hourlyIntensities['Day'] = hourlyIntensities['ActivityHour'].dt.date
hourlyIntensities['Time'] =  hourlyIntensities['ActivityHour'].dt.time


stepsMean_By_hour = []

for hour in (hourlyIntensities['Time'].unique()): 
  stepsMean_By_hour.append(np.average(hourlyIntensities[hourlyIntensities['Time']==hour]['TotalIntensity']))



# df['datetime'] = pd.to_datetime(df['datetime'])

data = {'hours': hourlyIntensities['Time'].unique(),
        'Average Intensities': stepsMean_By_hour}

fig = px.bar(data, x='hours', y='Average Intensities', title = 'Average Intensities by hour')

fig.update_layout(xaxis=dict(
                    title = 'Hours',
                    ), 
                  yaxis=dict(
                    title='Average Intensities',
                    side='left'                 
                    )
)

fig.show()

This follows the steps by hour. We can do the same analysis as the previous.

Calories through the day¶

In [17]:
# Bar chart: Average time by type of activity

hourlyCalories = pd.read_csv("Data_Coursera_CaseStudy02/hourlyCalories_merged.csv", index_col=[0])
hourlyCalories['ActivityHour'] = pd.to_datetime(hourlyCalories['ActivityHour'])

hourlyCalories['Day'] = hourlyCalories['ActivityHour'].dt.date
hourlyCalories['Time'] =  hourlyCalories['ActivityHour'].dt.time


caloriesAVG_By_hour = []

for hour in (hourlyCalories['Time'].unique()): 
  caloriesAVG_By_hour.append(np.average(hourlyCalories[hourlyCalories['Time']==hour]['Calories']))



# df['datetime'] = pd.to_datetime(df['datetime'])

data = {'hours': hourlyCalories['Time'].unique(),
        'Average calories': caloriesAVG_By_hour}

fig = px.bar(data, x='hours', y='Average calories', title = 'Average calories by hour')

fig.update_layout(xaxis=dict(
                    title = 'Hours',
                    ), 
                  yaxis=dict(
                    title='Average calories',
                    side='left'                 
                    )
)

fig.show()

Also, we can do the same analysis as the previous. The most calories are burned in the work hours.

Time Sleep vs Daily Steps¶

In [18]:
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
sleepDay = pd.read_csv("Data_Coursera_CaseStudy02/sleepDay_merged.csv", index_col=[0])

# Remove the time data from sleepDay Dataframe
sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')

# Renaming the columns
sleepDay.rename(columns={'SleepDay': 'Day'}, inplace=True)
dailySteps.rename(columns={'ActivityDay': 'Day'}, inplace=True)
In [19]:
# New dataframe 
df_analysis_sleep_steps = pd.merge(dailySteps, sleepDay, on = ['Id', 'Day'])
# df_analysis_sleep_steps
In [20]:
# Plot
fig = px.scatter(x=df_analysis_sleep_steps['TotalMinutesAsleep'], y=df_analysis_sleep_steps['StepTotal'], title="Minutes Asleep vs Steps total")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Minutes Asleep',
                      ), 
                    yaxis=dict(
                      title='Total Steps in a day',
                      side='left',
))


fig.show()

In this plot we can see that most of the users concentrate between ~440 minutes of sleep and ~10k steps per day.

It doesn't appear to exist a relation in the two data.

Sleep Time vs Time in Bed¶

In [21]:
sleepDay = pd.read_csv("Data_Coursera_CaseStudy02/sleepDay_merged.csv", index_col=[0])

# Remove the time data from sleepDay Dataframe
sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')
sleepDay

fig = px.scatter(x=sleepDay['TotalMinutesAsleep'], y=sleepDay['TotalTimeInBed'], title="Minutes Asleep vs Total time in Bed")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Minutes Asleep',
                      ), 
                    yaxis=dict(
                      title='Total Time in Bed',
                      side='left',
))


fig.show()

An exactly linear relation. But kind of obvius.

Sleep Records vs Daily Steps¶

In [22]:
# Plot
fig = px.scatter(x=df_analysis_sleep_steps['TotalSleepRecords'], y=df_analysis_sleep_steps['StepTotal'], title="Minutes Asleep vs Steps total")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Total Sleep Records',
                      ), 
                    yaxis=dict(
                      title='Total Steps in a day',
                      side='left',
))


fig.show()

Also, nothing that we can confirm.

If we had more information of people with 3 sleeping records we could affirm that those who doesn't appear to sleep well tends to take less steps per day.

Calories vs Daily Steps¶

In [23]:
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
caloriesDay = pd.read_csv("Data_Coursera_CaseStudy02/dailyCalories_merged.csv", index_col=[0])

# Remove the time data from sleepDay Dataframe
# sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')

# Renaming the columns
caloriesDay.rename(columns={'ActivityDay': 'Day'}, inplace=True)
dailySteps.rename(columns={'ActivityDay': 'Day'}, inplace=True)
In [24]:
# New dataframe 
df_analysis_calories_steps = pd.merge(dailySteps, caloriesDay, on = ['Id', 'Day'])
# df_analysis_sleep_steps
In [25]:
# Plot
fig = px.scatter(x=df_analysis_calories_steps['Calories'], y=df_analysis_calories_steps['StepTotal'], title="Calories vs Steps total")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Calories spent in a Day',
                      ), 
                    yaxis=dict(
                      title='Total Steps in a day',
                      side='left',
))


fig.show()

A well correlation. The more steps you do in your day, the more calories are spent.

We have some outliers, but that can be:

  1. An error of the device;
  2. The outlier is counting calories just from the steps (and the calories of every other data point is a sum of steps and activities);

Time Activity vs Sleep Time¶

In [26]:
dailyActivity = pd.read_csv("Data_Coursera_CaseStudy02/dailyActivity_merged.csv", index_col=[0])
sleepDay = pd.read_csv("Data_Coursera_CaseStudy02/sleepDay_merged.csv", index_col=[0])

# Remove the time data from sleepDay Dataframe
sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')

# Renaming the columns
dailyActivity.rename(columns={'ActivityDate': 'Day'}, inplace=True)
sleepDay.rename(columns={'SleepDay': 'Day'}, inplace=True)
In [27]:
# Merging the Dataframes
df_analysis_calories_steps = pd.merge(dailyActivity, sleepDay, on = ['Id', 'Day'])
df_analysis_calories_steps
Out[27]:
Unnamed: 0.1 Unnamed: 0 Id Day TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
0 0 0 1503960366 4/12/2016 13162 8.50 8.50 0.0 1.88 0.55 6.06 0.0 25 13 328 728 1985 1 327 346
1 1 1 1503960366 4/13/2016 10735 6.97 6.97 0.0 1.57 0.69 4.71 0.0 21 19 217 776 1797 2 384 407
2 3 3 1503960366 4/15/2016 9762 6.28 6.28 0.0 2.14 1.26 2.83 0.0 29 34 209 726 1745 1 412 442
3 4 4 1503960366 4/16/2016 12669 8.16 8.16 0.0 2.71 0.41 5.04 0.0 36 10 221 773 1863 2 340 367
4 5 5 1503960366 4/17/2016 9705 6.48 6.48 0.0 3.19 0.78 2.51 0.0 38 20 164 539 1728 1 700 712
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
405 827 898 8792009665 4/30/2016 7174 4.59 4.59 0.0 0.33 0.36 3.91 0.0 10 20 301 749 2896 1 343 360
406 828 899 8792009665 5/1/2016 1619 1.04 1.04 0.0 0.00 0.00 1.04 0.0 0 0 79 834 1962 1 503 527
407 829 900 8792009665 5/2/2016 1831 1.17 1.17 0.0 0.00 0.00 1.17 0.0 0 0 101 916 2015 1 415 423
408 830 901 8792009665 5/3/2016 2421 1.55 1.55 0.0 0.00 0.00 1.55 0.0 0 0 156 739 2297 1 516 545
409 831 902 8792009665 5/4/2016 2283 1.46 1.46 0.0 0.00 0.00 1.46 0.0 0 0 129 848 2067 1 439 463

410 rows × 20 columns

In [38]:
# Figure Subplots
fig = make_subplots(rows=1, cols=3 ,shared_yaxes=True)

fig.add_trace(
    go.Scatter(x=df_analysis_calories_steps['LightlyActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Lightly Activities'),
    row=1, col=1,
)


fig.add_trace(
    go.Scatter(x=df_analysis_calories_steps['FairlyActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Fairly Active'),
    row=1, col=2
)


fig.add_trace(
    go.Scatter(x=df_analysis_calories_steps['VeryActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Very Active'),
    row=1, col=3
)

fig.update_layout(height=600, width=900, title_text="Activity Time vs Sleep Time ", hovermode="x unified")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Time spent in lightly activities',
                      ), 
                    yaxis=dict(
                      title='Time Slept',
                      side='left',
))

fig.update_xaxes(row=1, col=2, title = 'Time spent in fairly activities')
fig.update_xaxes(row=1, col=3, title = 'Time spent in very active activities')


# fig.update_traces(connectgaps=False)

fig.show()

Here we can see a preference to lightly activities instead of fairly or very active activities.

Again, most of those people sleep ~440 minutes a day (7 hours and 20 minutes).

Bute, we also have many points below 300 minutes (5 hours), and that is not healty.

In [39]:
# Density plots from the plot above: 

# Figure Subplots
fig = make_subplots(rows=1, cols=3 ,shared_yaxes=True)

fig.add_trace(
    go.Histogram2dContour(x=df_analysis_calories_steps['LightlyActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], colorscale = 'Blues',name='Lightly Activities'),
    row=1, col=1,
)


fig.add_trace(
    go.Histogram2dContour(x=df_analysis_calories_steps['FairlyActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], colorscale = 'Blues',name='Fairly Active'),
    row=1, col=2
)


fig.add_trace(
    go.Histogram2dContour(x=df_analysis_calories_steps['VeryActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], colorscale = 'Blues',name='Very Active'),
    row=1, col=3
)

fig.update_layout(height=800, width=900, title_text="Activity Time vs Sleep Time ", hovermode="x unified")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Time spent in lightly activities',
                      range = [0,450]
                      ), 
                    yaxis=dict(
                      title='Time Slept',
                      side='left',
))

fig.update_xaxes(row=1, col=2, title = 'Time spent in fairly activities', range = [-5.5,55])
fig.update_xaxes(row=1, col=3, title = 'Time spent in very active activities', range = [-10,85])


# fig.update_traces(connectgaps=False)

Same as above, but with a density plot. We can see that the preference in lightly activities exists.

In [40]:
# Figure Subplots
fig = make_subplots(rows=1, cols=3 ,shared_yaxes=True)

fig.add_trace(
    go.Scatter(x=df_analysis_calories_steps['LightActiveDistance'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Lightly Activities Distance'),
    row=1, col=1,
)


fig.add_trace(
    go.Scatter(x=df_analysis_calories_steps['ModeratelyActiveDistance'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Moderately Active Distance'),
    row=1, col=2
)


fig.add_trace(
    go.Scatter(x=df_analysis_calories_steps['VeryActiveDistance'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Very Active Distance'),
    row=1, col=3
)

fig.update_layout(height=600, width=900, title_text="Distance in activities vs Sleep Time ", hovermode="x unified")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Distance in lightly activities',
                      ), 
                    yaxis=dict(
                      title='Time Slept',
                      side='left',
))

fig.update_xaxes(row=1, col=2, title = 'Distance in moderately activities')
fig.update_xaxes(row=1, col=3, title = 'Distance in very active activities')


# fig.update_traces(connectgaps=False)

fig.show()

Same as Time in activities vs Sleep Time.

Calories vs time in activities¶

In [31]:
dailyActivity = pd.read_csv("Data_Coursera_CaseStudy02/dailyActivity_merged.csv", index_col=[0])
caloriesDay = pd.read_csv("Data_Coursera_CaseStudy02/dailyCalories_merged.csv", index_col=[0])

# Remove the time data from sleepDay Dataframe
# sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')

# Renaming the columns
caloriesDay.rename(columns={'ActivityDay': 'Day'}, inplace=True)
# Renaming the columns
dailyActivity.rename(columns={'ActivityDate': 'Day'}, inplace=True)
In [32]:
# Merging the Dataframes
df_analysis_calories_activities = pd.merge(dailyActivity, caloriesDay, on = ['Id', 'Day'])
df_analysis_calories_activities
Out[32]:
Unnamed: 0.1 Unnamed: 0 Id Day TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories_x Calories_y
0 0 0 1503960366 4/12/2016 13162 8.500000 8.500000 0.0 1.88 0.55 6.06 0.00 25 13 328 728 1985 1985
1 1 1 1503960366 4/13/2016 10735 6.970000 6.970000 0.0 1.57 0.69 4.71 0.00 21 19 217 776 1797 1797
2 2 2 1503960366 4/14/2016 10460 6.740000 6.740000 0.0 2.44 0.40 3.91 0.00 30 11 181 1218 1776 1776
3 3 3 1503960366 4/15/2016 9762 6.280000 6.280000 0.0 2.14 1.26 2.83 0.00 29 34 209 726 1745 1745
4 4 4 1503960366 4/16/2016 12669 8.160000 8.160000 0.0 2.71 0.41 5.04 0.00 36 10 221 773 1863 1863
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
858 858 935 8877689391 5/8/2016 10686 8.110000 8.110000 0.0 1.08 0.20 6.80 0.00 17 4 245 1174 2847 2847
859 859 936 8877689391 5/9/2016 20226 18.250000 18.250000 0.0 11.10 0.80 6.24 0.05 73 19 217 1131 3710 3710
860 860 937 8877689391 5/10/2016 10733 8.150000 8.150000 0.0 1.35 0.46 6.28 0.00 18 11 224 1187 2832 2832
861 861 938 8877689391 5/11/2016 21420 19.559999 19.559999 0.0 13.22 0.41 5.89 0.00 88 12 213 1127 3832 3832
862 862 939 8877689391 5/12/2016 8064 6.120000 6.120000 0.0 1.82 0.04 4.25 0.00 23 1 137 770 1849 1849

863 rows × 18 columns

In [41]:
# Figure Subplots
fig = make_subplots(rows=1, cols=3 ,shared_yaxes=True)

fig.add_trace(
    go.Scatter(x=dailyActivity['LightlyActiveMinutes'], y=dailyActivity['Calories'], mode='markers',name='Lightly Activities'),
    row=1, col=1,
)


fig.add_trace(
    go.Scatter(x=dailyActivity['FairlyActiveMinutes'], y=dailyActivity['Calories'], mode='markers',name='Fairly Active'),
    row=1, col=2
)


fig.add_trace(
    go.Scatter(x=dailyActivity['VeryActiveMinutes'], y=dailyActivity['Calories'], mode='markers',name='Very Active'),
    row=1, col=3
)

fig.update_layout(height=600, width=900, title_text="Activity Time vs Calories", hovermode="x unified")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'Time spent in lightly activities',
                      ), 
                    yaxis=dict(
                      title='Calories',
                      side='left',
))

fig.update_xaxes(row=1, col=2, title = 'Time spent in fairly activities')
fig.update_xaxes(row=1, col=3, title = 'Time spent in very active activities')


# fig.update_traces(connectgaps=False)

fig.show()

Average time by type of activity¶

In [34]:
# Bar chart: Average time by type of activity

data = {'labels': ['Ligthly', 'Fairly', 'Very active'],
        'Mean time in activities': [np.average(dailyActivity['LightlyActiveMinutes']), np.average(dailyActivity['FairlyActiveMinutes']), np.average(dailyActivity['VeryActiveMinutes'])]}

fig = px.bar(data, x='labels', y='Mean time in activities', title = 'Average time by type of activity')

fig.update_layout(xaxis=dict(
                    title = 'Type of Activity',
                    ), 
                  yaxis=dict(
                    title='Mean time (minutes)',
                    side='left'                 
                    )
)

fig.show()

AVG Steps vs Body Mass Index¶

In [35]:
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
weightLogInfo = pd.read_csv("Data_Coursera_CaseStudy02/weightLogInfo_merged.csv", index_col=[0])



dailySteps = dailySteps.groupby(by=['Id']).mean()
dailySteps.reset_index(inplace=True)

df_analysis_steps_BMI = pd.merge(dailySteps, weightLogInfo, on = ['Id'])
df_analysis_steps_BMI


fig = px.scatter(x=df_analysis_steps_BMI['BMI'], y=df_analysis_steps_BMI['StepTotal'], title="BMI vs Average Steps total")

fig.update_layout(legend=dict(
                      orientation="h",
                      yanchor="bottom",
                      y=1.02,
                      xanchor="right",
                      x=1,
                      title=''
                  ),xaxis=dict(
                      title = 'BMI (Body Mass Index)',
                      ), 
                    yaxis=dict(
                      title='Total Steps in a day',
                      side='left',
))


fig.show()

# data = {'Status': ['Active', 'Not Active'],
#         'number of status': [ len(dailySteps[dailySteps['Status']=='Active']), len(dailySteps[dailySteps['Status']=='Not Active'])]}


# fig = px.pie(data, values='number of status', names='Status', title='Percentage of active users based on 10.000 steps daily')
# fig.show()

# print(f"Average number of steps taken by day: {np.average(dailySteps['StepTotal'])} steps")
C:\Users\Spuck\AppData\Local\Temp\ipykernel_19412\1163727648.py:6: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

Ok. We do not have much data, but we can presume that people with BMI higher than 40 really has difficulty to do more exercises.

Act phase¶

It is important to note that the data analyzed in this study only represents a partial sample of Bellabeat clients, and therefore cannot be generalized to the entire population. The sample size of 7~33 people is not large enough to assume that the behaviors observed are representative of the broader population. As such, any conclusions or insights drawn from this data should be considered with caution and should not be applied to the entire Bellabeat client base without further research and analysis. It is important to gather a more significant sample size to accurately reflect the population's behavior and make informed decisions based on the data collected.

Insigths from the data:¶

  • Most os these users can't be considered active, with less than 10.000 steps per day;

    • The users want to know data about their health, and for that, use the gadgets. But, even so, they can't manages to change their habits to become more active.
  • Some users do not sleep well;

    • We could see a lot of data points below 300 minutes a day.
  • Most of the users do not have a habit to do exercises;

    • The average time spent with lightly activities may be related to the job, and the average with the others activities are too low to be considered to be healthy.
  • The most active time is from 17h to 20h;

    • This time could be the time of leaving work and going to the gym.
  • People with high BMI tends to make less exercise;

Recomendations¶

Based on the trends and hypothesis that has been found, I recommend create new features to the ibtegration of Bellabeat Leaf with Bellabeat App and/ or Bellabeat membership.

  • New sleeping traking system:

    • To help users improve their sleep quality, the Bellabeat app could implement a notification system that sends reminders to users who have experienced poor sleep quality. These notifications could provide helpful tips and suggestions related to sleep hygiene, such as establishing a regular bedtime routine or creating a relaxing sleep environment. Additionally, the app could offer guided meditations or breathing exercises to help users relax before bed and improve the quality of their sleep.
  • Game mode of steps daily task and activity time (with rewards):

    • Setting daily step goals can be an effective way to motivate Bellabeat users to walk more. By establishing a daily target, users can track their progress and see how close they are to achieving their goal. This can create a sense of accomplishment and provide a tangible incentive to stay active throughout the day. The Bellabeat app could provide users with personalized step targets based on their current activity levels, and offer rewards or incentives for achieving milestones or consistently meeting their goals.
  • Social media share options on app:

    • Setting options to share the acomplishment of tasks in the actual days should have an excelent impact and incentive people to stay active. An environment of competition and companionship can be created once groups could be formed, and the users can start to help each other.

In conclusion, I believe that more research and analysis is needed to better understand the specific needs and behaviors of Bellabeat users in order to continue to improve and tailor the product to meet their needs.